Variable rate speech coding

ABSTRACT

A method and apparatus for the variable rate coding of a speech signal. An input speech signal is classified and an appropriate coding mode is selected based on this classification. For each classification, the coding mode that achieves the lowest bit rate with an acceptable quality of speech reproduction is selected. Low average bit rates are achieved by only employing high fidelity modes (i.e., high bit rate, broadly applicable to different types of speech) during portions of the speech where this fidelity is required for acceptable output. Lower bit rate modes are used during portions of speech where these modes produce acceptable output. Input speech signal is classified into active and inactive regions. Active regions are further classified into voiced, unvoiced, and transient regions. Various coding modes are applied to active speech, depending upon the required level of fidelity. Coding modes may be utilized according to the strengths and weaknesses of each particular mode. The apparatus dynamically switches between these modes as the properties of the speech signal vary with time. And where appropriate, regions of speech are modeled as pseudo-random noise, resulting in a significantly lower bit rate. This coding is used in a dynamic fashion whenever unvoiced speech or background noise is detected.

BACKGROUND OF THE INVENTION

[0001] I. Field of the Invention

[0002] The present invention relates to the coding of speech signals.Specifically, the present invention relates to classifying speechsignals and employing one of a plurality of coding modes based on theclassification.

[0003] II. Description of the Related Art

[0004] Many communication systems today transmit voice as a digitalsignal, particularly long distance and digital radio telephoneapplications. The performance of these systems depends, in part, onaccurately representing the voice signal with a minimum number of bits.Transmitting speech simply by sampling and digitizing requires a datarate on the order of 64 kilobits per second (kbps) to achieve the speechquality of a conventional analog telephone. However, coding techniquesare available that significantly reduce the data rate required forsatisfactory speech reproduction.

[0005] The term “vocoder” typically refers to devices that compressvoiced speech by extracting parameters based on a model of human speechgeneration. Vocoders include an encoder and a decoder. The encoderanalyzes the incoming speech and extracts the relevant parameters. Thedecoder synthesizes the speech using the parameters that it receivesfrom the encoder via a transmission channel. The speech signal is oftendivided into frames of data and block processed by the vocoder.

[0006] Vocoders built around linear-prediction-based time domain codingschemes far exceed in number all other types of coders. These techniquesextract correlated elements from the speech signal and encode only theuncorrelated elements. The basic linear predictive filter predicts thecurrent sample as a linear combination of past samples. An example of acoding algorithm of this particular class is described in the paper “A4.8 kbps Code Excited Linear Predictive Coder,” by Thomas E. Tremain etal., Proceedings of the Mobile Satellite Conference, 1988.

[0007] These coding schemes compress the digitized speech signal into alow bit rate signal by removing all of the natural redundancies (i e.,correlated elements) inherent in speech. Speech typically exhibits shortterm redundancies resulting from the mechanical action of the lips andtongue, and long term redundancies resulting from the vibration of thevocal cords. Linear predictive schemes model these operations asfilters, remove the redundancies, and then model the resulting residualsignal as white gaussian noise. Linear predictive coders thereforeachieve a reduced bit rate by transmitting filter coefficients andquantized noise rather than a full bandwidth speech signal.

[0008] However, even these reduced bit rates often exceed the availablebandwidth where the speech signal must either propagate a long distance(e.g. ground to satellite) or coexist with many other signals in acrowded channel. A need therefore exists for an improved coding schemewhich achieves a lower bit rate than linear predictive schemes.

SUMMARY OF THE INVENTION

[0009] The present invention is a novel and improved method andapparatus for the variable rate coding of a speech signal. The presentinvention classifies the input speech signal and selects an appropriatecoding mode based on this classification. For each classification, thepresent invention selects the coding mode that achieves the lowest bitrate with an acceptable quality of speech reproduction. The presentinvention achieves low average bit rates by only employing high fidelitymodes (i.e., high bit rate, broadly applicable to different types ofspeech) during portions of the speech where this fidelity is requiredfor acceptable output. The present invention switches to lower bit ratemodes during portions of speech where these modes produce acceptableoutput.

[0010] An advantage of the present invention is that speech is coded ata low bit rate. Low bit rates translate into higher capacity, greaterrange, and lower power requirements.

[0011] A feature of the present invention is that the input speechsignal is classified into active and inactive regions. Active regionsare further classified into voiced, unvoiced, and transient regions. Thepresent invention therefore can apply various coding modes to differenttypes of active speech, depending upon the required level of fidelity.

[0012] Another feature of the present invention is that coding modes maybe utilized according to the strengths and weaknesses of each particularmode. The present invention dynamically switches between these modes asproperties of the speech signal vary with time.

[0013] A further feature of the present invention is that, whereappropriate, regions of speech are modeled as pseudo-random noise,resulting in a significantly lower bit rate. The present invention usesthis coding in a dynamic fashion whenever unvoiced speech or backgroundnoise is detected.

[0014] The features, objects, and advantages of the present inventionwill become more apparent from the detailed description set forth belowwhen taken in conjunction with the drawings in which like referencenumbers indicate identical or functionally similar elements.Additionally, the left-most digit of a reference number identifies thedrawing in which the reference number first appears.

BRIEF DESCRIPTION OF THE DRAWINGS

[0015]FIG. 1 is a diagram illustrating a signal transmissionenvironment;

[0016]FIG. 2 is a diagram illustrating encoder 102 and decoder 104 ingreater detail;

[0017]FIG. 3 is a flowchart illustrating variable rate speech codingaccording to the present invention;

[0018]FIG. 4A is a diagram illustrating a frame of voiced speech splitinto subframes;

[0019]FIG. 4B is a diagram illustrating a frame of unvoiced speech splitinto subframes;

[0020]FIG. 4C is a diagram illustrating a frame of transient speechsplit into subframes;

[0021]FIG. 5 is a flowchart that describes the calculation of initialparameters;

[0022]FIG. 6 is a flowchart describing the classification of speech aseither active or inactive;

[0023]FIG. 7A depicts a CELP encoder;

[0024]FIG. 7B depicts a CELP decoder;

[0025]FIG. 8 depicts a pitch filter module;

[0026]FIG. 9A depicts a PPP encoder;

[0027]FIG. 9B depicts a PPP decoder;

[0028]FIG. 10 is a flowchart depicting the steps of PPP coding,including encoding and decoding;

[0029]FIG. 11 is a flowchart describing the extraction of a prototyperesidual period;

[0030]FIG. 12 depicts a prototype residual period extracted from thecurrent frame of a residual signal, and the prototype residual periodfrom the previous frame;

[0031]FIG. 13 is a flowchart depicting the calculation of rotationalparameters;

[0032]FIG. 14 is a flowchart depicting the operation of the encodingcodebook;

[0033]FIG. 15A depicts a first filter update module embodiment;

[0034]FIG. 15B depicts a first period interpolator module embodiment;

[0035]FIG. 16A depicts a second filter update module embodiment;

[0036]FIG. 16B depicts a second period interpolator module embodiment;

[0037]FIG. 17 is a flowchart describing the operation of the firstfilter update module embodiment;

[0038]FIG. 18 is a flowchart describing the operation of the secondfilter update module embodiment;

[0039]FIG. 19 is a flowchart describing the aligning and interpolatingof prototype residual periods;

[0040]FIG. 20 is a flowchart describing the reconstruction of a speechsignal based on prototype residual periods according to a firstembodiment;

[0041]FIG. 21 is a flowchart describing the reconstruction of a speechsignal based on prototype residual periods according to a secondembodiment;

[0042]FIG. 22A depicts a NELP encoder;

[0043]FIG. 22B depicts a NELP decoder; and

[0044]FIG. 23 is a flowchart describing NELP coding.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0045] I. Overview of the Environment

[0046] II. Overview of the Invention

[0047] III. Initial Parameter Determination

[0048] A. Calculation of LPC Coefficients

[0049] B. LSI Calculation

[0050] C. NACF Calculation

[0051] D. Pitch Track and Lag Calculation

[0052] E. Calculation of Band Energy and Zero Crossing Rate

[0053] F. Calculation of the Formant Residual

[0054] IV. Active/Inactive Speech Classification

[0055] A. Hangover Frames

[0056] V. Classification of Active Speech Frames

[0057] VI. Encoder/Decoder Mode Selection

[0058] VII. Code Excited Linear Prediction (CELP) Coding Mode

[0059] A. Pitch Encoding Module

[0060] B. Encoding codebook

[0061] C. CELP Decoder

[0062] D. Filter Update Module

[0063] VIII. Prototype Pitch Period (PPP) Coding Mode

[0064] A. Extraction Module

[0065] B. Rotational Correlator

[0066] C. Encoding Codebook

[0067] D. Filter Update Module

[0068] E. PPP Decoder

[0069] F. Period Interpolator

[0070] IX. Noise Excited Linear Prediction (NELP) Coding Mode

[0071] X. Conclusion

[0072] I. Overview of the Environment

[0073] The present invention is directed toward novel and improvedmethods and apparatuses for variable rate speech coding. FIG. 1 depictsa signal transmission environment 100 including an encoder 102, adecoder104, and a transmission medium 106. Encoder 102 encodes a speech signals(n), forming encoded speech signal s_(enc)(n), for transmission acrosstransmission medium 106 to decoder 104. Decoder 104 decodes s_(enc)(n),thereby generating synthesized speech signal ŝ(n).

[0074] The term “coding” as used herein refers generally to methodsencompassing both encoding and decoding. Generally, coding methods andapparatuses seek to minimize the number of bits transmitted viatransmission medium 106 (i.e., minimize the bandwidth of S_(enc)(n))while maintaining acceptable speech reproduction (i.e., ŝ(n)≈s(n)). Thecomposition of the encoded speech signal will vary according to theparticular speech coding method. Various encoders 102, decoders 104, andthe coding methods according to which they operate are described below.

[0075] The components of encoder 102 and decoder 104 described below maybe implemented as electronic hardware, as computer software, orcombinations of both. These components are described below in terms oftheir functionality. Whether the functionality is implemented ashardware or software will depend upon the particular application anddesign constraints imposed on the overall system. Skilled artisans willrecognize the interchangeability of hardware and software under thesecircumstances, and how best to implement the described functionality foreach particular application.

[0076] Those skilled in the art will recognize that transmission medium106 can represent many different transmission media, including, but notlimited to, a land-based communication line, a link between a basestation and a satellite, wireless communication between a cellulartelephone and a base station, or between a cellular telephone and asatellite.

[0077] Those skilled in the art will also recognize that often eachparty to a communication transmits as well as receives. Each party wouldtherefore require an encoder 102 and a decoder 104. However, signaltranmission environment 100 will be described below as including encoder102 at one end of transmission medium 106 and decoder 104 at the other.Skilled artisans will readily recognize how to extend these ideas totwo-way communication.

[0078] For purposes of this description, assume that s(n) is a digitalspeech signal obtained during a typical conversation including differentvocal sounds and periods of silence. The speech signal s(n) ispreferably partitioned into frames, and each frame is furtherpartitioned into subframes (preferably 4). These arbitrarily chosenframe/subframe boundaries are commonly used where some block processingis performed, as is the case here. Operations described as beingperformed on frames might also be performed on subframes—in this sense,frame and subframe are used interchangeably herein. However, s(n) neednot be partitioned into frames/subframes at all if continuous processingrather than block processing is implemented. Skilled artisans willreadily recognize how the block techniques described below might beextended to continuous processing.

[0079] In a preferred embodiment, s(n) is digitally sampled at 8 kHz.Each frame preferably contains 20 ms of data, or 160 samples at thepreferred 8 kHz rate. Each subframe therefore contains 40 samples ofdata. It is important to note that many of the equations presented belowassume these values. However, those skilled in the art will recognizethat while these parameters are appropriate for speech coding, they aremerely exemplary and other suitable alternative parameters could beused.

[0080] II. Overview of the Invention

[0081] The methods and apparatuses of the present invention involvecoding the speech signal s(n). FIG. 2 depicts encoder 102 and decoder104 in greater detail. According to the present invention, encoder 102includes an initial parameter calculation module 202, a classificationmodule 208, and one or more encoder modes 204. Decoder 104 includes oneor more decoder modes 206. The number of decoder modes, N_(d), ingeneral equals the number of encoder modes, N_(e). As would be apparentto one skilled in the art, encoder mode 1 communicates with decoder mode1, and so on. As shown, the encoded speech signal, S_(enc)(n), istransmitted via transmission medium 106.

[0082] In a preferred embodiment, encoder 102 dynamically switchesbetween multiple encoder modes from frame to frame, depending on whichmode is most appropriate given the properties of s(n) for the currentframe. Decoder 104 also dynamically switches between the correspondingdecoder modes from frame to frame. A particular mode is chosen for eachframe to achieve the lowest bit rate available while maintainingacceptable signal reproduction at the decoder. This process is referredto as variable rate speech coding, because the bit rate of the coderchanges over time (as properties of the signal change).

[0083]FIG. 3 is a flowchart 300 that describes variable rate speechcoding according to the present invention. In step 302, initialparameter calculation module 202 calculates various parameters based onthe current frame of data. In a preferred embodiment, these parametersinclude one or more of the following: linear predictive coding (LPC)filter coefficients, line spectruminformation (LSI) coefficients, thenormalized autocorrelation functions (NACFs), the open loop lag, bandenergies, the zero crossing rate, and the formant residual signal.

[0084] In step 304, classification module 208 classifies the currentframe as containing either “active” or “inactive” speech. As describedabove, s(n) is assumed to include both periods of speech and periods ofsilence, common to an ordinary conversation. Active speech includesspoken words, whereas inactive speech includes everything else, e.g.,background noise, silence, pauses. The methods used to classify speechas active/inactive according to the present invention are described indetail below.

[0085] As shown in FIG. 3, step 306 considers whether the current framewas classified as active or inactive in step 304. If active, controlflow proceeds to step 308. If inactive, control flow proceeds to step310.

[0086] Those frames which are classified as active are furtherclassified in step 308 as either voiced, unvoiced, or transient frames.Those skilled in the art will recognize that human speech can beclassified in many different ways. Two conventional classifications ofspeech are voiced and unvoiced sounds. According to the presentinvention, all speech which is not voiced or unvoiced is classified astransient speech.

[0087]FIG. 4A depicts an example portion of s(n) including voiced speech402. Voiced sounds are produced by forcing air through the glottis withthe tension of the vocal cords adjusted so that they vibrate in arelaxed oscillation, thereby producing quasi-periodic pulses of airwhich excite the vocal tract. One common property measured in voicedspeech is the pitch period, as shown in FIG. 4A.

[0088]FIG. 4B depicts an example portion of s(n) including unvoicedspeech 404. Unvoiced sounds are generated by forming a constriction atsome point in the vocal tract (usually toward the mouth end), andforcing air through the constriction at a high enough velocity toproduce turbulence. The resulting unvoiced speech signal resemblescolored noise.

[0089]FIG. 4C depicts an example portion of s(n) including transientspeech 406 (i.e., speech which is neither voiced nor unvoiced). Theexample transient speech 406 shown in FIG. 4C might represent s(n)transitioning between unvoiced speech and voiced speech. Skilledartisans will recognize that many different classifications of speechcould be employed according to the techniques described herein toachieve comparable results.

[0090] In step 310, an encoder/decoder mode is selected based on theframe classification made in steps 306 and 308. The variousencoder/decoder modes are connected in parallel, as shown in FIG. 2. Oneor more of these modes can be operational at any given time. However, asdescribed in detail below, only one mode preferably operates at anygiven time, and is selected according to the classification of thecurrent frame.

[0091] Several encoder/decoder modes are described in the followingsections. The different encoder/decoder modes operate according todifferent coding schemes. Certain modes are more effective at codingportions of the speech signal s(n) exhibiting certain properties.

[0092] In a preferred embodiment, a “Code Excited Linear Predictive”(CELP) mode is chosen to code frames classified as transient speech. TheCELP mode excites a linear predictive vocal tract model with a quantizedversion of the linear prediction residual signal. Of all theencoder/decoder modes described herein, CELP generally produces the mostaccurate speech reproduction but requires the highestbit rate. In oneembodiment, the CELP mode performs encoding at 8500 bits per second.

[0093] A “Prototype Pitch Period” (PPP) mode is preferably chosen tocode frames classified as voiced speech. Voiced speech contains slowlytime varying periodic components which are exploited by the PPP mode.The PPP mode codes only a subset of the pitch periods within each frame.The remaining periods of the speech signal are reconstructed byinterpolating between these prototype periods. By exploiting theperiodicity of voiced speech, PPP is able to achieve a lower bit ratethan CELP and still reproduce the speech signal in a perceptuallyaccurate manner. In one embodiment, the PPP mode performs encoding at3900 bits per second.

[0094] A “Noise Excited Linear Predictive” (NELP) mode is chosen to codeframes classified as unvoiced speech. NELP uses a filtered pseudo-randomnoise signal to model unvoiced speech. NELP uses the simplest model forthe coded speech, and therefore achieves the lowest bit rate. In oneembodiment, the NELP mode performs encoding at 1500 bits per second.

[0095] The same coding technique can frequently be operated at differentbit rates, with varying levels of performance. The differentencoder/decoder modes in FIG. 2 can therefore represent different codingtechniques, or the same coding technique operating at different bitrates, or combinations of the above. Skilled artisans will recognizethat increasing the number of encoder/decoder modes will allow greaterflexibility when choosing a mode, which can result in a lower averagebit rate, but will increase complexity within the overall system. Theparticular combination used in any given system will be dictated by theavailable system resources and the specific signal environment.

[0096] In step 312, the selected encoder mode 204 encodes the currentframe and preferably packs the encoded data into data packets fortransmission. And in step 314, the corresponding decoder mode 206unpacks the data packets, decodes the received data and reconstructs thespeech signal. These operations are described in detail below withrespect to the appropriate encoder/decoder modes.

[0097] III. Initial Parameter Determination

[0098]FIG. 5 is a flowchart describing step 302 in greater detail.Various initial parameters are calculated according to the presentinvention. The parameters preferably include, e.g., LPC coefficients,line spectrum information (LSI) coefficients, normalized autocorrelationfunctions (NACFs), open loop lag, band energies, zero crossing rate, andthe formant residual signal. These parameters are used in various wayswithin the overall system, as described below.

[0099] In a preferred embodiment, initial parameter calculation module202 uses a “look ahead” of 160+40 samples. This serves several purposes.First, the 160 sample look ahead allows a pitch frequency track to becomputed using information in the next frame, which significantlyimproves the robustness of the voice coding and the pitch periodestimation techniques, described below. Second, the 160 sample lookahead also allows the LPC coefficients, the frame energy, and the voiceactivity to be computed for one frame in the future. This allows forefficient, multi-frame quantization of the frame energy and LPCcoefficients. Third, the additional 40 sample look ahead is forcalculation of the LPC coefficients on Hamming windowed speech asdescribed below. Thus the number of samples buffered before processingthe current frame is 160+160+40 which includes the current frame and the160+40 sample look ahead.

[0100] A. Calculation of LPC Coefficients

[0101] The present invention utilizes an LPC prediction error filter toremove the short term redundancies in the speech signal. The transferfunction for the LPC filter is:${A(z)} = {1 - {\sum\limits_{i = 1}^{10}{a_{i}z^{- i}}}}$

[0102] The present invention preferably implements a tenth-order filter,as shown in the previous equation. An LPC synthesis filter in thedecoder reinserts the redundancies, and is given by the inverse of A(z):$\frac{1}{A(z)} = \frac{1}{1 - {\sum\limits_{i = 1}^{10}{a_{i}z^{- i}}}}$

[0103] In step 502, the LPC coefficients, α_(i), are computed from s(n)as follows. The LPC parameters are preferably computed for the nextframe during the encoding procedure for the current frame.

[0104] A Hamming window is applied to the current frame centered betweenthe 119^(th) and 120^(th) samples (assuming the preferred 160 sampleframe with a “look ahead”). The windowed speech signal, s_(w)(n) isgiven by:${s_{w}(n)} = {{{s( {n + 40} )}( {{0.5 + {0.46*\cos ( {\pi \frac{n - 79.5}{80}} )}},} )0} \leq n < 160}$

[0105] The offset of 40 samples results in the window of speech beingcentered between the 119^(th) and 120^(th) sample of the preferred 160sample frame of speech.

[0106] Eleven autocorrelation values are preferably computed as${{R(k)} = {\sum\limits_{m = 0}^{159 - k}{{s_{w}(m)}{s_{w}( {m + k} )}}}},{0 \leq k \leq 10}$

[0107] The autocorrelation values are windowed to reduce the probabilityof missing roots of line spectral pairs (LSPs) obtained from the LPCcoefficients, as given by:

R(k)=h(k)R(k), 0≦k≦10

[0108] resulting in a slight bandwidth expansion, e.g., 25 Hz. Thevalues h(k) are preferably taken from the center of a 255 point Hammingwindow.

[0109] The LPC coefficients are then obtained from the windowedautocorrelation values using Durbin's recursion. Durbin's recursion, awell known efficient computational method, is discussed in the textDigital Processing of Speech Signals by Rabiner & Schafer.

[0110] B. LSI Calculation

[0111] In step 504, the LPC coefficients are transformed into linespectrum information (LSI) coefficients for quantization andinterpolation. The LSI coefficients are computed according to thepresent invention in the following manner.

[0112] As before, A(z) is given by

A(z)=1−α₁ z ⁻¹− . . . −α₁₀ z ⁻¹⁰,

[0113] where α₁ are the LPC coefficients, and 1≦i≦10.

[0114] P_(A)(z) and Q_(A)(z) are defined as the following

P _(A)(z)=A(z)+z ⁻¹¹ A(z ⁻¹)=p ₀ +p ₁ z ⁻¹ + . . . +p ₁₁ z ⁻¹¹,

Q _(A)(z)=A(z)−z ⁻¹¹ A(z ⁻¹)=q ₀ +q ₁ z ⁻¹ + . . . +q ₁₁ z ⁻¹¹,

[0115] where

p _(i)=−α_(i)−α_(11−l), 1≦i≦10

q _(i)=−α_(i)+α_(11−l), 1≦i≦10

[0116] and

p ₀=1 p ₁₁=1

q ₀=1 q ₁₁=−1

[0117] The line spectral cosines (LSCs) are the ten roots in −1.0<x<1.0of the following two functions:

P′(x)=p′ _(o) cos(5 cos⁻¹(x))+p′ ₁(4 cos⁻¹(x))+ . . . +p′ ₄ +p′ ₅/2

Q′(x)=q′ _(o) cos(5 cos⁻¹(x))+q′ ₁(4 cos⁻¹(x))+ . . . +q′ ₄ x+q′ ₅/2

[0118] where

[0119] p′_(o)=1

[0120] q′_(o)=1

[0121] p′_(l)=p_(i)−p′_(i−1) 1≦i≦5

[0122] q′_(l)=q_(l)+q′_(i−1) 1≦i≦5

[0123] The LSI coefficients are then calculated as:${lsi}_{i} = \{ \begin{matrix}{0.5\sqrt{1 - {lsc}_{i}}} & {{lsc}_{i} \geq 0} \\{1.0 - {0.5\sqrt{1 + {lsc}_{i}}}} & {{lsc}_{i} < 0}\end{matrix} $

[0124] The LSCs can be obtained back from the LSI coefficients accordingto: ${lsc}_{i} = \{ \begin{matrix}{1.0 - {4{lsi}_{i}^{2}}} & {{lsi}_{i} \leq 0.5} \\{( {4 - {4{lsi}_{i}^{2}}} ) - 1.0} & {{lsi}_{i} > 0.5}\end{matrix} $

[0125] The stability of the LPC filter guarantees that the roots of thetwo functions alternate, i. e., the smallest root, lsc₁, is the smallestroot of P′(x), the next smallest root, lsc₂, is the smallest root ofQ′(x), etc. Thus, lsc₁, lsc₃, lsc₅, lsc₇, and lsc₉ are the roots ofP′(x), and ls₂, lsc₄, lsc₆, lsc₈, and lsc₁₀ are the roots of Q′(x).

[0126] Those skilled in the art will recognize that it is preferable toemploy some method for computing the sensitivity of the LSI coefficientsto quantization. “Sensitivity weightings” can be used in thequantization process to appropriately weight the quantization error ineach LSI.

[0127] The LSI coefficients are quantized using a multistage vectorquantizer (VQ). The number of stages preferably depends on theparticular bit rate and codebooks employed. The codebooks are chosenbased on whether or not the current frame is voiced.

[0128] The vector quantization minimizes a weighted-mean-squared error(WMSE) which is defined as${E( {\overset{arrow}{x},\overset{arrow}{y}} )} = {\sum\limits_{i = 0}^{P - 1}( {w_{i}( {x_{i} - y_{i}} )} )^{2}}$

[0129] where {right arrow over (x)} is the vector to be quantized,{right arrow over (w)} the weight associated with it, and {right arrowover (y)} is the codevector. In a preferred embodiment, {right arrowover (w)} are sensitivity weightings and P=10.

[0130] The LSI vector is reconstructed from the LSI codes obtained byway of quantization${q\overset{arrow}{l}\quad {si}} = {\sum\limits_{i = 1}^{N}{{CB}{\overset{arrow}{i}}_{{code}_{i}}}}$

[0131] where CBi is the i^(th) stage VQ codebook for either voiced orunvoiced frames (this is based on the code indicating the choice of thecodebook) and code_(i) is the LSI code for the i^(th) stage.

[0132] Before the LSI coefficients are transformed to LPC coefficients,a stability check is performed to ensure that the resulting LPC filtershave not been made unstable due to quantization noise or channel errorsinjecting noise into the LSI coefficients. Stability is guaranteed ifthe LSI coefficients remain ordered.

[0133] In calculating the original LPC coefficients, a speech windowcentered between the 119^(th) and 120^(th) sample of the frame was used.The LPC coefficients for other points in the frame are approximated byinterpolating between the previous frame's LSCs and the current frame'sLSCs. The resulting interpolated LSCs are then converted back into LPCcoefficients. The exact interpolation used for each subframe is givenby:

ilsc _(j)=(1−α_(i))lscprev _(j)+α_(i) lsccurr _(j), 1≦j≦10

[0134] where α_(i) are the interpolation factors 0.375, 0.625, 0.875,1.000 for the four subframes of 40 samples each and ilsc are theinterpolated LSCs. {circumflex over (P)}_(A)(z) and {circumflex over(Q)}_(A)(z) are computed by the interpolated LSCs as${{\hat{P}}_{A}(z)} = {{( {1 + z^{- 1}} ){\prod\limits_{j = 1}^{5}1}} - {2{ilsc}_{{2j} - 1}z^{- 1}} + z^{- 2}}$${{\hat{Q}}_{A}(z)} = {{( {1 + z^{- 1}} ){\prod\limits_{j = 1}^{5}1}} - {2{ilsc}_{2{j1}}z^{- 1}} + z^{- 2}}$

[0135] The interpolated LPC coefficients for all four subframes arecomputed as coefficients of${\hat{A}(z)} = \frac{{{\hat{P}}_{A}(z)} + {{\hat{Q}}_{A}(z)}}{2}$${Thus},{{\hat{a}}_{i} = \{ \begin{matrix}{- \frac{{\hat{p}}_{i} + {\hat{q}}_{i}}{2}} & {1 \leq i \leq 5} \\{- \frac{{\hat{p}}_{11 - i} - {\hat{q}}_{11 - i}}{2}} & {6 \leq i \leq 10}\end{matrix} }$

[0136] C. NACF Calculation

[0137] In step 506, the normalized autocorrelation functions (NACFs) arecalculated according to the current invention.

[0138] The formant residual for the next frame is computed over four 40sample subframes as${r(n)} = {{s(n)} - {\sum\limits_{i = 1}^{10}{{\overset{\sim}{a}}_{i}{s( {n - i} )}}}}$

[0139] where ã_(l) is the i^(th) interpolated LPC coefficient of thecorresponding subframe, where the interpolation is done between thecurrent frame's unquantized LSCs and the next frame's LSCs. The nextframe's energy is also computed as$E_{N} = {0.5{\log_{2}( \frac{\sum\limits_{i = 0}^{159}{r^{2}(n)}}{160} )}}$

[0140] The residual calculated above is low pass filtered and decimated,preferably using a zero phase FIR filter of length 15, the coefficientsof which df_(i), −7≦i≦7, are {0.0800, 0.1256, 0.2532, 0.4376, 0.6424,0.8268, 0.9544, 1.000, 0.9544, 0.8268, 0.6424, 0.4376, 0.2532, 0.1256,0.0800}. The low pass filtered, decimated residual is computed as$\begin{matrix}{{{r_{d}(n)} = {\sum\limits_{i = {- 7}}^{7}{{df}_{i}{r( {{Fn} + i} )}}}},} & {0 \leq n < {160/F}}\end{matrix}$

[0141] where F=2 is the decimation factor, and r(Fn+i), −7≦Fn+i≦6 areobtained from the last 14 values of the current frame's residual basedon unquantized LPC coefficients. As mentioned above, these LPCcoefficients are computed and stored during the previous frame.

[0142] The NACFs for two subframes (40 samples decimated) of the nextframe are calculated as follows: $\begin{matrix}{{{Exx}_{k} = {\sum\limits_{i = 0}^{39}{{r_{d}( {{40k} + i} )}{r_{d}( {{40k} + i} )}}}},} & {{k = 0},1} & \quad \\{{{Exy}_{k,j} = {\sum\limits_{i = 0}^{39}{{r_{d}( {{40k} + i} )}{r_{d}( {{40k} + i - j} )}}}},} & {{{12/2} \leq j < {128/2}},} & {{k = 0},1} \\{{{Eyy}_{k,j} = {\sum\limits_{i = 0}^{39}{{r_{d}( {{40k} + i - j} )}{r_{d}( {{40k} + i - j} )}}}},} & {{{12/2} \leq j < {128/2}},} & {{k = 0},1} \\{{{n\_ corr}_{k,{j - {12/2}}} = \frac{( {Exy}_{k,j} )^{2}}{{ExxEyy}_{k,j}}},} & {{{12/2} \leq j < {128/2}},} & {{k = 0},1}\end{matrix}$

[0143] For r_(d)(n) with negative n, the current frame's low-passfiltered and decimated residual (stored during the previous frame) isused. The NACFs for the current subframe c_corr were also computed andstored during the previous frame.

[0144] D. Pitch Track and Lag Calculation

[0145] In step 508, the pitch track and pitch lag are computed accordingto the present invention. The pitch lag is preferably calculated using aViterbi-like search with a backward track as follows. $\begin{matrix}{{{R1}_{i} = {{n\_ corr}_{0,i} + {\max \{ {n\_ corr}_{1,{j + {FAN}_{i,0}}} \}}}},} & {{0 \leq i < {116/2}},} & {0 \leq j < {FAN}_{i,1}} \\{ {{R2}_{i} = {{c\_ corr}_{l,i} + {\max \{ {R1}_{j + {FAN}_{i,o}} }}} ),} & {{0 \leq i < {116/2}},} & {0 \leq j < {FAN}_{i,1}} \\{ {{RM}_{2i} = {{R2}_{i} + {\max \{ {c\_ corr}_{0,{j + {FAN}_{i,0}}} }}} ),} & {{0 \leq i < {116/2}},} & {0 \leq j < {FAN}_{i,1}}\end{matrix}$

[0146] where FAN_(ij) is the 2×58 matrix, {{0, 2}, {0, 3}, {2, 2}, {2,3}, {2, 4}, {3, 4}, {4, 4}, {5, 4}, {5, 5}, {6, 5}, {7, 5}, {8, 6}, {9,6}, {10, 6}, {11, 6}, {11, 7}, {12, 7}, {13, 7 }, {14, 8}, {15, 8}, {16,8}, {16, 9}, {17, 9}, {18, 9}, {19, 9}, {20, 10}, {21, 10}, {22, 10},{22, 11}, {23, 11}, {24, 11}, {25, 12}, {26, 12}, {27, 12}, {28, 12},{28, 13}, {29, 13}, {30, 13}, {31, 14}, {32, 14}, {33, 14}, {33, 15},{34, 15}, {35, 15}, {36, 15}, {37, 16}, {38, 16}, {39, 16}, {39, 17},{40, 17}, {41, 16}, {42, 16}, {43, 15}, {44, 14}, {45, 13}, {45, 13},{46, 12}, {47, 11}}. The vector RM_(2i) is interpolated to get valuesfor R_(2i+1) as $\begin{matrix}{{{RM}_{{iF} + 1} = {\sum\limits_{j = 0}^{4}{{cf}_{j}{RM}_{{({i - 1 + j})}F}}}},} & {1 \leq i < {112/2}} \\{{RM}_{1} = {( {{RM}_{0} + {RM}_{2}} )/2}} & \quad \\{{RM}_{{2*56} + 1} = {( {{RM}_{2*56} + {RM}_{2*57}} )/2}} & \quad \\{{RM}_{{2*57} + 1} = {RM}_{2*57}} & \quad\end{matrix}$

[0147] where cf_(j) is the interpolation filter whose coefficients are{−0.0625, 0.5625, 0.5625, −0.0625}. The lag L_(C) is then chosen suchthat R_(L) _(c−12) =max{R_(i)}, 4≦i≦116 and the current frame's NACF isset equal to R_(L) _(C−12) /4. Lag multiples are then removed bysearching for the lag corresponding to the maximum correlation greaterthan 0.9 R_(L) _(C−12) amidst:

R _(max{└L) _(C) _(/M┘−)14, 16} . . . R _(└L) _(C/M┘−10) for all 1≦M≦└L_(C)/16┘.

[0148] E. Calculation of Band Energy and Zero Crossing Rate

[0149] In step 510, energies in the 0-2 kHz band and 2 kHz-4 kHz bandare computed according to the present invention as$E_{L} = {\sum\limits_{i = 0}^{159}{s_{L}^{2}(n)}}$$E_{H} = {\sum\limits_{i = 0}^{159}{s_{H}^{2}(n)}}$${where},{{S_{L}(z)} = {{S(z)}\frac{{bl}_{0} + {\sum\limits_{i = 1}^{15}{{bl}_{i}z^{- i}}}}{{al}_{0} + {\sum\limits_{i = 1}^{15}{{al}_{i}z^{- i}}}}}}$${S_{H}(z)} = {{S(z)}\frac{{bh}_{0} + {\sum\limits_{i = 1}^{15}{{bh}_{i}z^{- i}}}}{{ah}_{0} + {\sum\limits_{i = 1}^{15}{{ah}_{i}z^{- i}}}}}$

[0150] S(z), S_(L)(z) and S_(H)(z) being the z-transforms of the inputspeech signal s(n), low-pass signal s_(L)(n) and high-pass signals_(H)(n), respectively, bl={0.0003, 0.0048, 0.0333, 0.1443, 0.4329,0.9524, 1.5873, 2.0409, 2.0409, 1.5873, 0.9524, 0.4329, 0.1443, 0.0333,0.0048, 0.0003}, al={1.0, 0.9155, 2.4074, 1.6511, 2.0597, 1.0584,0.7976, 0.3020, 0.1465, 0.0394, 0.0122, 0.0021, 0.0004, 0.0, 0.0, 0.0},bh={0.0013, −0.0189, 0.1324, −0.5737, 1.7212, −3.7867, 6.3112, −8.1144,8.1144, −6.3112, 3.7867, −1.7212, 0.5737, −0.1324, 0.0189, −0.0013}andah={1.0, −2.8818, 5.7550, −7.7730, 8.2419, −6.8372, 4.6171, −2.5257,1.1296, −0.4084, 0.1183, −0.0268, 0.0046, −0.0006, 0.0, 0.0}.

[0151] The speech signal energy itself is$E = {\sum\limits_{i = 0}^{159}{{s^{2}(n)}.}}$

[0152] The zero crossing rate ZCR is computed as

if(s(n)s(n+1)<0)ZCR=ZCR+1, 0≦n<159

[0153] F. Calculation of the Formant Residual

[0154] In step 512, the formant residual for the current frame iscomputed over four subframes as${r_{curr}(n)} = {{s(n)} - {\sum\limits_{i = 1}^{10}{{\hat{a}}_{i}{s( {n - i} )}}}}$

[0155] where â_(i) is the i^(th) LPC coefficient of the correspondingsubframe.

[0156] IV. Active/Inactive Speech Classification

[0157] Referring back to FIG. 3, in step 304, the current frame isclassified as either active speech (e.g., spoken words) or inactivespeech (e.g., background noise, silence). FIG. 6 is a flowchart 600 thatdepicts step 304 in greater detail. In a preferred embodiment, a twoenergy band based thresholding scheme is used to determine if activespeech is present. The lower band (band 0) spans frequencies from0.1-2.0 kHz and the upper band (band 1) from 2.0-4.0 kHz. Voice activitydetection is preferably determined for the next frame during theencoding procedure for the current frame, in the following manner.

[0158] In step 602, the band energies Eb[i] for bands i=0, 1 arecomputed. The autocorrelation sequence, as described above in SectionIII.A., is extended to 19 using the following recursive equation:$\begin{matrix}{{{R(k)} = {\sum\limits_{i = 1}^{10}{a_{i}{R( {k - i} )}}}},} & {11 \leq k \leq 19}\end{matrix}$

[0159] Using this equation, R(11) is computed from R(1) to R(10), R(12)is computed from R(2) to R(11), and so on. The band energies are thencomputed from the extended autocorrelation sequence using the followingequation: $\begin{matrix}{{{E_{b}(i)} = {\log_{2}( {{{R(0)}{R_{h}(0)}(0)} + {2{\sum\limits_{k = 1}^{19}{{R(k)}{R_{h}(i)}(k)}}}} )}},} & {{i = 0},1}\end{matrix}$

[0160] where R(k) is the extended autocorrelation sequence for thecurrent frame and R_(h)(i)(k) is the band filter autocorrelationsequence for band i given in Table 1. TABLE 1 Filter AutocorrelationSequences for Band Energy Calculations k R_(h)(0)(k) band 0 R_(h)(1(k)band 1 0 4.230889E-01  4.042770E-O1 1 2.693014E-01 −2.503076E-01 2−1.124000E-02  −3.059308E-02 3 −1.301279E-01   1.497124E-01 4−5.949044E-02  −7.905954E-02 5 1.494007E-02  4.371288E-03 6−2.087666E-03  −2.088545E-02 7 −3.823536E-02   5.622753E-02 8−2.748034E-02  −4.420598E-02 9 3.015699E-04  1.443167E-02 103.722060E-03 −8.462525E-03 11 −6.416949E-03   1.627144E-02 12−6.551736E-03  −1.476080E-02 13 5.493820E-04  6.187041E-03 142.934550E-03 −1.898632E-03 15 8.041829E-04  2.053577E-03 16−2.857628E-04  −1.860064E-03 17 2.585250E-04  7.729618E-04 184.816371E-04 −2.297862E-04 19 1.692738E-04  2.107964E-04

[0161] In step 604, the band energy estimates are smoothed. The smoothedband energy estimates, E_(sm)(i), are updated for each frame using thefollowing equation.

E _(sm)(i)=0.6E _(sm)(i)+0.4E _(b)(i), i=0, 1

[0162] In step 606, signal energy and noise energy estimates areupdated. The signal energy estimates, E_(s)(i), are preferably updatedusing the following equation:

E _(s)(i)=max(E _(sm)(i), E _(s)(i)), i=0, 1

[0163] The noise energy estimates, E_(n)(i), are preferably updatedusing the following equation:

E _(n)(i)=min(E _(sm)(i),E _(n)(i)),i=0, 1

[0164] In step 608, the long term signal-to-noise ratios for the twobands, SNR(i), are computed as

SNR(i)=E _(s)(i)−E _(n)(i), i=0, 1

[0165] In step 610, these SNR values are preferably divided into eightregions Reg_(SNR)(i) defined as${{Reg}_{SNR}(i)} = \{ \begin{matrix}{\quad 0} & {\quad {{{0.6{{SNR}(i)}} - 4} < 0}} \\{\quad {{round}( {{0.6{{SNR}(i)}} - 4} )}} & {\quad {\leq {{0.6{{SNR}(i)}} - 4} < 7}} \\{\quad 7} & {\quad {{0.6{{SNR}(i)}} \geq 7}}\end{matrix} $

[0166] In step 612, the voice activity decision is made in the followingmanner according to the current invention. If eitherE_(b)(0)−E_(n)(0)>THRESH(Reg_(SNR)(0)), orE_(b)(1)−E_(n)(1)>THRESH(Reg_(SNR)(1)), then the frame of speech isdeclared active. Otherwise, the frame of speech is declared inactive.The values of THRESH are defined in Table 2. TABLE 2 Threshold Factorsas A function of the SNR Region SNR Region THRESH 0 2.807 1 2.807 23.000 3 3.104 4 3.154 5 3.233 6 3.459 7 3.982

[0167] The signal energy estimates, E_(s)(i), are preferably updatedusing the following equation:

E _(s)(i)=E _(s)(i)−0.014499, i=0, 1.

[0168] The noise energy estimates, E_(n)(i), are preferably updatedusing the following equation: ${E_{n}(i)} = \{ \begin{matrix}{\quad 4} & {\quad {{{E_{n}(i)} + 0.0066} < 4}} \\{\quad 23} & {\quad {{23 < {{E_{n}(i)} + 0.0066}},\quad {i = 0},1}} \\{\quad {{E_{n}(i)} + 0.0066}} & {\quad {otherwise}}\end{matrix} $

[0169] A. Hangover Frames

[0170] When signal-to-noise ratios are low, “hangover” frames arepreferably added to improve the quality of the reconstructed speech. Ifthe three previous frames were classified as active, and current frameis classified inactive, then the next M frames including the currentframe are classified as active speech. The number of hangover frames, M,is preferably determined as a function of SNR(0) as defined in Table 3.TABLE 3 Hangover Frames as a Function of SNR(0) SNR(0) M 0 4 1 3 2 3 3 34 3 5 3 6 3 7 3

[0171] V. Classification of Active Speech Frames

[0172] Referring back to FIG. 3, in step 308, current frames which wereclassified as being active in step 304 are further classified accordingto properties exhibited by the speech signal s(n). In a preferredembodiment, active speech is classified as either voiced, unvoiced, ortransient. The degreed of periodicity exhibited by the active speechsignal determines how it is classified. Voiced speech exhibits thehighest degree of periodicity (quasi-periodic in nature). Unvoicedspeech exhibits little or no periodicity. Transient speech exhibitsdegrees of periodicity between voiced and unvoiced.

[0173] However, the general framework described herein is not limited tothe preferred classification scheme and the specific encoder/decodermodes described below. Active speech can be classified in alternativeways, and alternative encoder/decoder modes are available for coding.Those skilled in the art will recognize that many combinations ofclassifications and encoder/decoder modes are possible. Many suchcombinations can result in a reduced average bit rate according to thegeneral framework described herein, i.e., classifying speech as inactiveor active, further classifying active speech, and then coding the speechsignal using encoder/decoder modes particularly suited to the speechfalling within each classification.

[0174] Although the active speech classifications are based on degree ofperiodicity, the classification decision is preferably not based on somedirect measurement of periodicty. Rather, the classification decision isbased on various parameters calculated in step 302, e.g., signal tonoise ratios in the upper and lower bands and the NACFs. The preferredclassification may be described by the following pseudo-code:

[0175] if not(previousN ACF<0.5 and currentN ACF>0.6)

[0176] if (currentN ACF<0.75 and ZCR>60) UNVOICED

[0177] else if (previousN ACF<0.5 and currentN ACF<0.55 and ZCR>50)UNVOICED

[0178] else if (currentN ACF<0.4 and ZCR>40) UNVOICED

[0179] if (UNVOICED and currentSNR>28 dB and E_(L)>αE_(H)) TRANSIENT

[0180] if (previousN ACF<0.5 and currentN ACF<0.5 and E <5e4+N) UNVOICED

[0181] if (VOICED and low-bandSNR>high-bandSNR and previousN ACF<0.8 and0.6<currentN ACF<0.75) TRANSIENT

[0182] where $\alpha = \{ \begin{matrix}{1.0,} & {E > {{5{e5}} + N_{noise}}} \\{20.0,} & {E \leq {{5{e5}} + N_{noise}}}\end{matrix} $

[0183] and N_(noise) is an estimate of the background noise. E_(prev) isthe previous frame's input energy.

[0184] The method described by this pseudo code can be refined accordingto the specific environment in which it is implemented. Those skilled inthe art will recognize that the various thresholds given above aremerely exemplary, and could require adjustment in practice dependingupon the implementation. The method may also be refined by addingadditional classification categories, such as dividing TRANSIENT intotwo categories: one for signals transitioning from high to low energy,and the other for signals transitioning from low to high energy.

[0185] Those skilled in the art will recognize that other methods areavailable for distinguishing voiced, unvoiced, and transient activespeech. Similarly, skilled artisans will recognize that otherclassification schemes for active speech are also possible.

[0186] VI. Encoder/Decoder Mode Selection

[0187] In step 310, an encoder/decoder mode is selected based on theclassification of the current frame in steps 304 and 308. According to apreferred embodiment, modes are selected as follows: inactive frames andactive unvoiced frames are coded using a NELP mode, active voiced framesare coded using a PPP mode, and active transient frames are coded usinga CELP mode. Each of these encoder/decoder modes is described in detailin following sections.

[0188] In an alternative embodiment, inactive frames are coded using azero rate mode Skilled artisans will recognize that many alternativezero rate modes are available which require very low bit rates. Theselection of a zero rate mode may be further refined by considering pastmode selections. For example, if the previous frame was classified asactive, this may preclude the selection of a zero rate mode for thecurrent frame. Similarly, if the next frame is active, a zero rate modemay be precluded for the current frame. Another alternative is topreclude the selection of a zero rate mode for too many consecutiveframes (e.g., 9 consecutive frames). Those skilled in the art willrecognize that many other modifications might be made to the basic modeselection decision in order to refine its operation in certainenvironments.

[0189] As described above, many other combinations of classificationsand encoder/decoder modes might be alternatively used within this sameframework. The following sections provide detailed descriptions ofseveral encoder/decoder modes according to the present invention. TheCELP mode is described first, followed by the PPP mode and the NELPmode.

[0190] VII. Code Excited Linear Prediction (CELP) Coding Mode

[0191] As described above, the CELP encoder/decoder mode is employedwhen the current frame is classified as active transient speech. TheCELP mode provides the most accurate signal reproduction (as compared tothe other modes described herein) but at the highest bit rate.

[0192]FIG. 7 depicts a CELP encoder mode 204 and a CELP decoder mode 206in farther detail. As shown in FIG. 7A, CELP encoder mode 204 includes apitch encoding module 702, an encoding codebook 704, and a filter updatemodule 706. CELP encoder mode 204 outputs an encoded speech signal,s_(enc)(n), which preferably includes codebook parameters and pitchfilter parameters, for transmission to CELP decoder mode 206. As shownin FIG. 7B, CELP decoder mode 206 includes a decoding codebook module708, a pitch filter 710, and an LPC synthesis filter 712. CELP decodermode 206 receives the encoded speech signal and outputs synthesizedspeech signal ŝ(n).

[0193] A. Pitch Encoding Module

[0194] Pitch encoding module 702 receives the speech signal s(n) and thequantized residual from the previous frame, p_(c)(n) (described below).Based on this input, pitch encoding module 702 generates a target signalx(n) and a set of pitch filter parameters. In a preferred embodiment,these pitch filter parameters include an optimal pitch lag L* and anoptimal pitch gain b*. These parameters are selected according to an“analysis-by-synthesis” method in which the encoding process selects thepitch filter parameters that minimize the weighted error between theinput speech and the synthesized speech using those parameters.

[0195]FIG. 8 depicts pitch encoding module 702 in greater detail. Pitchencoding module 702 includes a perceptual weighting filter 802, adders804 and 816, weighted LPC synthesis filters 806 and 808, a delay andgain 810, and a minimize sum of squares 812.

[0196] Perceptual weighting filter 802 is used to weight the errorbetween the original speech and the synthesized speech in a perceptuallymeaningful way. The perceptual weighting filter is of the form${W(z)} = \frac{A(z)}{A( {z/\gamma} )}$

[0197] where A(z) is the LPC prediction error filter, and y preferablyequals 0.8. Weighted LPC analysis filter 806 receives the LPCcoefficients calculated by initial parameter calculation module 202.Filter 806 outputs a_(zir)(n), which is the zero input response giventhe LPC coefficients. Adder 804 sums a negative input a_(zir)(n) and thefiltered input signal to form target signal x(n).

[0198] Delay and gain 810 outputs an estimated pitch filter outputbp_(L)(n) for a given pitch lag L and pitch gain b. Delay and gain 810receives the quantized residual samples from the previous frame,p_(c)(n), and an estimate of future output of the pitch filter, given byp_(o)(n), and forms p(n) according to: ${p(n)} = \{ \begin{matrix}{p_{c}(n)} & {{- 128} < n < 0} \\{p_{o}(n)} & {0 \leq n < L_{p}}\end{matrix} $

[0199] which is then delayed by L samples and scaled by b to formbp_(L)(n). Lp is the subframe length (preferably 40 samples). In apreferred embodiment, the pitch lag, L, is represented by 8 bits and cantake on values 20.0, 20.5, 21.0, 21.5, . . . 126.0, 126.5, 127.0, 127.5.

[0200] Weighted LPC analysis filter 808 filters bp_(L)(n) using thecurrent LPC coefficients resulting in by_(L)(n). Adder 816 sums anegative input by_(L)(n) with x(n), the output of which is received byminimize sum of squares 812. Minimize sum of squares 812 selects theoptimal L, denoted by L* and the optimal b, denoted by b*, as thosevalues of L and b that minimize E_(pitch)(L) according to:${E_{pitch}(L)} = {\sum\limits_{n = 0}^{L_{p} - 1}\{ {{x(n)} - {{by}_{L}(n)}} \}^{2}}$${{If}\quad {E_{xy}(L)}\underset{\_}{\underset{\_}{\Delta}}\quad {\sum\limits_{n = 0}^{L_{p} - 1}{{x(n)}{y_{L}(n)}\quad {and}\quad {E_{yy}(L)}\underset{\_}{\underset{\_}{\Delta}}\quad {\sum\limits_{n = 0}^{L_{p} - 1}{y_{L}(n)}^{2}}}}},$

[0201] then the value of b which minimizes E_(pitch)(L) for a givenvalue of L is $b^{*} = \frac{E_{xy}(L)}{E_{yy}(L)}$ for  which${E_{pitch}(L)} = {K - \frac{{E_{xy}(L)}^{2}}{E_{yy}(L)}}$

[0202] where K is a constant that can be neglected.

[0203] The optimal values of L and b (L* and b*) are found by firstdetermining the value of L which minimizes E_(pitch)(L) and thencomputing b*.

[0204] These pitch filter parameters are preferably calculated for eachsubframe and then quantized for efficient transmission. In a preferredembodiment, the transmission codes PLAGj and PGAINj for the j^(th)subframe are computed as${PGAINj} = {\lfloor {{\min \{ {b^{*},2} \} \frac{8}{2}} + 0.5} \rfloor - 1}$${PLAG}_{j} = \{ \begin{matrix}{0,} & {{PGAINj} = {- 1}} \\{{2L^{*}},} & {0 \leq {PGAINj} < 8}\end{matrix} $

[0205] PGAIN_(J) is then adjusted to −1 if PLAG_(J) is set to 0. Thesetransmission codes are transmitted to CELP decoder mode 206 as the pitchfilter parameters, part of the encoded speech signal s_(enc)(n).

[0206] B. Encoding Codebook

[0207] Encoding codebook 704 receives the target signal x(n) anddetermines a set of codebook excitation parameters which are used byCELP decoder mode 206, along with the pitch filter parameters, toreconstruct the quantized residual signal.

[0208] Encoding codebook 704 first updates x(n) as follows.

x(n)=x(n)−y _(pzir)(n), 0≦n<40

[0209] where y_(pzir)(n) is the output of the weighted LPC synthesisfilter (with memories retained from the end of the previous subframe) toan input which is the zero-input-response of the pitch filter withparameters {circumflex over (L)}* and {circumflex over (b)}*(andmemories resulting from the previous subframe's processing).

[0210] A backfiltered target {right arrow over (d)}≈{d_(n)}, 0≦n<40 iscreated as {right arrow over (d)}≈H^(T){right arrow over (x)} where$H = \begin{bmatrix}h_{0} & 0 & 0 & \ldots & 0 \\h_{1} & h_{0} & 0 & \ldots & 0 \\\ldots & \ldots & \ldots & \ldots & \ldots \\h_{39} & h_{38} & h_{37} & \ldots & h_{0}\end{bmatrix}$

[0211] is the impulse response matrix formed from the impulse response{h_(n)} and {circumflex over (x)}≈{x(n)},0≦n<40. Two more vectors{circumflex over (φ)}={φ_(n)} and {right arrow over (s)} are created aswell.

{right arrow over (S)}≈sign({right arrow over (d)})

[0212] $\varphi_{n} = \{ {{\begin{matrix}{{2{\sum\limits_{i = 0}^{39 - n}{h_{i}h_{i + n}}}},} & {0 < n < 40} \\{{\sum\limits_{i = 0}^{39}h_{i}^{2}},} & {n = 0}\end{matrix}{where}{{sign}(x)}} = \{ \begin{matrix}{1,} & {x \geq 0} \\{{- 1},} & {x < 0}\end{matrix} } $

[0213] Encoding codebook 704 initializes the values Exy* and Eyy* tozero and searches for the optimum excitation parameters, preferably withfour values of N (0, 1, 2, 3), according to:$\overset{arrow}{p} = {( {N + \{ {0,1,2,3,4} \}} ){\% 5}}$A = {p₀, p₀ + 5, …  , i^(′) < 40} B = {p₁, p₁ + 5, …  , k^(′) < 40}Den_(i, k) = 2φ₀ + s_(i)s_(k)φ_(|k − i|),  i ∈ A  k ∈ B$\{ {I_{0},I_{1}} \} = {\underset{\begin{matrix}{i \in A} \\{i \in B}\end{matrix}}{\arg \quad \max}\{ \frac{| d_{i} \middle| {+ | d_{k} |} }{{Den}_{i,k}} \}}${S₀, S₁} = {s_(I₀), s_(I₁)} Exy0 = |d_(I₀)|+|d_(I₁)|Eyy0 = Eyy_(I₀, I₁)A = {p₂, p₂ + 5, …  , i^(′) < 40} B = {p₃, p₃ + 5, …  , k^(′) < 40}$\begin{matrix}{{Den}_{i,k} = \quad {{Eyy0} + {2\varphi_{0}} + {s_{i}( {{S_{0}\varphi_{|{I_{0} - i}|}} + {S_{1}\varphi_{|{I_{1} - i}|}}} )} +}} \\{\quad {{s_{k}( {{S_{0}\varphi_{|{I_{0} - k}|}} + {S_{1}\varphi_{|{I_{1} - k}|}}} )} + {s_{i}s_{k}\varphi_{|{k - i}|}}}}\end{matrix}$ i ∈ Ak ∈ B$\{ {I_{2},I_{3}} \} = {\underset{\begin{matrix}{i \in A} \\{k \in B}\end{matrix}}{\arg \quad \max}\{ \frac{ {{Exy0} +} \middle| d_{i} \middle| {+ | d_{k} |} }{{Den}_{i,k}} \}}${S2, S₃} = {s_(I₂), s_(I₃)}Exy1 = Exy0 + |d_(I₂)|+|d_(I₃)|Eyy1 = Den_(I₂, I₃)A = {p₄, p₄ + 5, …  , i^(′) < 40} $\begin{matrix}{{Den}_{i} = \quad {{Eyy1} + \varphi_{0} + {s_{i}( {{S_{0}\varphi_{|{I_{0} - i}|}} + {S_{1}\varphi_{|{I_{1} - i}|}} +} }}} \\{{\quad  {{S_{2}\varphi_{|{I_{2} - i}|}} + {S_{3}\varphi_{|{I_{3} - i}|}}} )},\quad {i \in A}}\end{matrix}$$I_{4} = {\underset{i \in A}{\arg \quad \max}\{ \frac{ {{Exy1} +} \middle| d_{i} |}{{Den}_{i}} \}}$S₄ = s_(I₄) Exy2 = Exy1 + |d_(I₄)|Eyy2 = Den_(I₄) $\begin{matrix}{{{If}\quad {Exy2}^{2}{Eyy}^{*}} > \quad {{Exy}^{*2}{Eyy2}\{}} \\{\quad {{Exy}^{*} = {Exy2}}} \\{\quad {{Eyy}^{*} = {Eyy2}}} \\{\quad {\{ {{ind}_{p0},{ind}_{p1},{ind}_{p2},{ind}_{p3},{ind}_{p4}} \} =}} \\{\quad \{ {I_{0},I_{1},I_{2},I_{3},I_{4}} \}} \\{\quad {\{ {{sgn}_{p0},{sgn}_{p1},{sgn}_{p2},{sgn}_{p3},{sgn}_{p4}} \} =}} \\{\quad  \{ {S_{0},S_{1},S_{2},S_{3},S_{4}} \} \}}\end{matrix}$

[0214] Encoding codebook 704 calculates the codebook gain${G^{*}\quad {as}\quad \frac{{Exy}^{*}}{{Eyy}^{*}}},$

[0215] and then quantizes the set of excitation parameters as thefollowing transmission codes for the j^(th) subframe:${CBIjk} = \begin{matrix}{\lfloor \frac{{ind}_{k}}{5} \rfloor,} & {0 \leq k < 5}\end{matrix}$ ${SIGNjk} = \{ {\begin{matrix}{0,} & {{sgn}_{k} = 1} \\\quad & \quad \\{1,} & {{sgn}_{k} = {- 1}}\end{matrix},{{0 \leq k < 5}{{CBGj} = \lfloor {{\min \{ {{\log_{2}( {\max \{ {1,G^{*}} \}} )},11.2636} \} \frac{31}{11.2636}} + 0.5} \rfloor}}} $

[0216] and the quantized gain${\hat{G}}^{*}\quad {is}\quad {2^{{CBGj}\frac{11.2636}{31}}.}$

[0217] Lower bit rate embodiments of the CELP encoder/decoder mode maybe realized by removing pitch encoding module 702 and only performing acodebook search to determine an index I and gain G for each of the foursubframes. Those skilled in the art will recognize how the ideasdescribed above might be extended to accomplish this lower bit rateembodiment.

[0218] C. CELP Decoder

[0219] CELP decoder mode 206 receives the encoded speech signal,preferably including codebook excitation parameters and pitch filterparameters, from CELP encoder mode 204, and based on this data outputssynthesized speech ŝ(n). Decoding codebook module 708 receives thecodebook excitation parameters and generates the excitation signal cb(n)with a gain of G. The excitation signal cb(n) for the j^(th) subframecontains mostly zeroes except for the five locations:

I _(k)=5CBIjk+k, 0≦k<5

[0220] which correspondingly have impulses of value

S _(k)=1−2SIGNjk, 0<k<5

[0221] all of which are scaled by the gain G which is computed to be$2^{{CBGj}\frac{11.2636}{31}},$

[0222] to provide Gcb(n).

[0223] Pitch filter 710 decodes the pitch filter parameters from thereceived transmission codes according to:${\hat{L}}^{*} = \frac{PLAGj}{2}$${\hat{b}}^{*} = \{ \begin{matrix}{0,} & {{\hat{L}}^{*} = 0} \\{{\frac{2}{8}{PGAINj}},} & {{\hat{L}}^{*} \neq 0}\end{matrix} $

[0224] Pitch filter 710 then filters Gcb(n), where the filter has atransfer function given by$\frac{1}{P(z)} = \frac{1}{1 - {b*z^{- L^{*}}}}$

[0225] In a preferred embodiment, CELP decoder mode 206 also adds anextra pitch filtering operation, a pitch prefilter (not shown), afterpitch filter 710. The lag for the pitch prefilter is the same as that ofpitch filter 710, whereas its gain is preferably half of the pitch gainup to a maximum of 0.5.

[0226] LPC synthesis filter 712 receives the reconstructed quantizedresidual signal {circumflex over (r)}(n) and outputs the synthesizedspeech signal ŝ(n).

[0227] D. Filter Update Module

[0228] Filter update module 706 synthesizes speech as described in theprevious section in order to update filter memories. Filter updatemodule 706 receives the codebook excitation parameters and the pitchfilter parameters, generates an excitation signal cb(n), pitch filtersGcb(n), and then synthesizes ŝ(n). By performing this synthesis at theencoder, memories in the pitch filter and in the LPC synthesis filterare updated for use when processing the following subframe.

[0229] VfII. Prototype Pitch Period (PPP) Coding Mode

[0230] Prototype pitch period (PPP) coding exploits the periodicity of aspeech signal to achieve lower bit rates than may be obtained using CELPcoding. In general, PPP coding involves extracting a representativeperiod of the residual signal, referred to herein as the prototyperesidual, and then using that prototype to construct earlier pitchperiods in the frame by interpolating between the prototype residual ofthe current frame and a similar pitch period from the previous frame(i.e., the prototype residual if the last frame was PPP). Theeffectiveness (in terms of lowered bit rate) of PPP coding depends, inpart, on how closely the current and previous prototype residualsresemble the intervening pitch periods. For this reason, PPP coding ispreferably applied to speech signals that exhibit relatively highdegrees of periodicity (e.g., voiced speech), referred to herein asquasi-periodic speech signals.

[0231]FIG. 9 depicts a PPP encoder mode 204 and a PPP decoder mode 206in further detail. PPP encoder mode 204 includes an extraction module904, a rotational correlator 906, an encoding codebook 908, and a filterupdate module 910. PPP encoder mode 204 receives the residual signalr(n) and outputs an encoded speech signal s_(enc)(n), which preferablyincludes codebook parameters and rotational parameters. PPP decodermode206 includes a codebook decoder 912, a rotator 914, an adder 916, aperiod interpolator 920, and a warping filter 918.

[0232]FIG. 10 is a flowchart 1000 depicting the steps of PPP coding,including encoding and decoding. These steps are discussed along withthe various components of PPP encoder mode 204 and PPP decoder mode 206.

[0233] A. Extraction Module

[0234] In step 1002, extraction module 904 extracts a prototype residualr_(p)(n) from the residual signal r(n). As described above in SectionIII.F., initial parameter calculation module 202 employs an LPC analysisfilter to compute r(n) for each frame. In a preferred embodiment, theLPC coefficients in this filter are perceptually weighted as describedin Section VII.A. The length of r_(p)(n) is equal to the pitch lag Lcomputed by initial parameter calculation module 202 during the lastsubframe in the current frame.

[0235]FIG. 11 is a flowchart depicting step 1002 in greater detail. PPPextraction module 904 preferably selects a pitch period as close to theend of the frame as possible, subject to certain restrictions discussedbelow. FIG. 12 depicts an example of a residual signal calculated basedon quasi-periodic speech, including the current frame and the lastsubframe from the previous frame.

[0236] In step 1102, a “cut-free region” is determined. The cut-freeregion defines a set of samples in the residual which cannot beendpoints of the prototype residual. The cut-free region ensures thathigh energy regions of the residual do not occur at the beginning or endof the prototype (which could cause discontinuities in the output wereit allowed to happen). The absolute value of each of the final L samplesof r(n) is calculated. The variable P_(S) is set equal to the time indexof the sample with the largest absolute value, referred to herein as the“pitch spike.” For example, if the pitch spike occurred in the lastsample of the final L samples, P_(S)=L−1. In a preferred embodiment, theminimum sample of the cut-free region, CF_(min), is set to be P_(S)−6 orP_(S)−0.25L, whichever is smaller. The maximum of the cut-free region,CF_(max), is set to be P_(S)+6 or P_(S)+0.25L, whichever is larger.

[0237] In step 1104, the prototype residual is selected by cutting Lsamples from the residual. The region chosen is as close as possible tothe end of the frame, under the constraint that the endpoints of theregion cannot be within the cut-free region. The L samples of theprototype residual are determined using the algorithm described in thefollowing pseudo-code:

[0238] if(CF_(min)<0) {

[0239] for(i=0 to L+CF_(min)−1)r_(p)(i)=r(i+160−L)

[0240] for(i=CF_(min) to L−1)r_(p)(i)=r(i+160−2L)

[0241] }

[0242] else if(CF_(max)≦L {

[0243] for(i=0 to CF_(min)−1)r_(p)(i)=r(i+160−L)

[0244] for(i=CF_(min) to L−1)r_(p)(i)=r(i+160−2L)

[0245] }

[0246] else {

[0247] for(i=0 to L−1)r_(p)(i)=r(i+160−L)

[0248] }

[0249] B. Rotational Correlator

[0250] Referring back to FIG. 10, in step 1004, rotational correlator906 calculates a set of rotational parameters based on the currentprototype residual, r_(p)(n), and the prototype residual from theprevious frame, r_(prev)(n). These parameters describe how r_(prev)(n)can best be rotated and scaled for use as a predictor of r_(p)(n). In apreferred embodiment, the set of rotational parameters includes anoptimal rotation R* and an optimal gain b*. FIG. 13 is a flowchartdepicting step 1004 in greater detail.

[0251] In step 1302, the perceptually weighted target signal x(n), iscomputed by circularly filtering the prototype pitch residual periodr_(p)(n). This is achieved as follows. A temporary signal tmp1(n) iscreated from r_(p)(n) as ${{tmp1}(n)} = \{ \begin{matrix}{\quad {{r_{p}(n)},}} & {\quad {0 \leq n < L}} \\{\quad {0,}} & {\quad {L \leq n < {2L}}}\end{matrix} $

[0252] which is filtered by the weighted LPC synthesis filter with zeromemories to provide an output tmp2(n). In a preferred embodiment, theLPC coefficients used are the perceptually weighted coefficientscorresponding to the last subframe in the current frame. The targetsignal x(n) is then given by

x(n)=tmp2(n)+tmp2(n+L), 0≦n<L

[0253] In step 1304, the prototype residual from the previous frame,r_(prev)(n), is extracted from the previous frame's quantized formantresidual (which is also in the pitch filter's memories). The previousprototype residual is preferably defined as the last L_(p) values of theprevious frame's formant residual, where L_(p) is equal to L if theprevious frame was not a PPP frame, and is set to the previous pitch lagotherwise.

[0254] In step 1306, the length of r_(prev)(n) is altered to be of thesame length as x(n) so that correlations can be correctly computed. Thistechnique for altering the length of a sampled signal is referred toherein as warping. The warped pitch excitation signal, rw_(prev)(n), maybe described as

rw _(prev)(n)=r _(prev)(n*TWF), 0≦n<L

[0255] where TWF is the time warping factor $\frac{L_{p}}{L}$

[0256] The sample values at non-integral points n * TWF are preferablycomputed using a set of sinc function tables. The sinc sequence chosenis sinc(−3−F: 4−F) where F is the fractional part of n * TWF rounded tothe nearest multiple of $\frac{1}{8}.$

[0257] The beginning of this sequence is aligned with r_(prev)((N−3)%L_(p)) where N is the integral part of n*TWF after being rounded to thenearest eighth.

[0258] In step 1308, the warped pitch excitation signal rw_(prev)(n) iscircularly filtered, resulting in y(n). This operation is the same asthat described above with respect to step 1302, but applied torw_(prev)(n).

[0259] In step 1310, the pitch rotation search range is computed byfirst calculating an expected rotation E_(rot),$E_{rot} = {L - {{round}( {L\quad {{frac}( \frac{( {160 - L} )( {L_{p} + L} )}{2L_{p}L} )}} )}}$

[0260] where frac(x) gives the fractional part of x. If L<80, the pitchrotation search range is defined to be {E_(rot)−8, E_(rot)−7.5, . . .E_(rot)+7.5}, and {E_(rot)−16, E_(rot)−15, . . . E_(rot)+15} where L≧80.

[0261] In step 1312, the rotational parameters, optimal rotation R* andan optimal gain b*, are calculated. The pitch rotation which results inthe best prediction between x(n) and y(n) is chosen along with thecorresponding gain b. These parameters are preferably chosen to minimizethe error signal e(n)=x(n)−y(n). The optimal rotation R* and the optimalgain b* are those values of rotation R and gain b which result in themaximum value of$\frac{{Exy}_{R}^{2}}{E_{yy}},{{{where}\quad {Exy}_{R}} = {{\sum\limits_{i = 0}^{L - 1}\quad {{x( {( {i + R} )\% L} )}{y(i)}\quad {and}\quad {Eyy}}} = {\sum\limits_{i = 0}^{L - 1}\quad {{y(i)}{y(i)}}}}}$

[0262] for which the optimal gain$b^{*}\quad {is}\quad \frac{{Exy}_{R^{*}}}{Eyy}$

[0263] at rotation R*. For fractional values of rotation, the value ofExy_(R) is approximated by interpolating the values of Exy_(R) computedat integer values of rotation. A simple four tap interplation filter isused. For example,

Exy _(R)=0.54(Exy _(R′) +Exy _(R′+1))−0.04*(Exy _(R′−1) +Exy _(R′+2))

[0264] where R is a non-integral rotation (with precision of 0.5) andR′=└R┘.

[0265] In a preferred embodiment, the rotational parameters arequantized for efficient transmission. The optimal gain b* is preferablyquantized uniformly between 0.0625 and 4.0 as${PGAIN} = {\max \{ {{\min ( {\lfloor {{63( \frac{b^{*} - 0.0625}{4 - 0.0625} )} + 0.5} \rfloor,63} )},0} \}}$

[0266] where PGAIN is the transmission code and the quantized gain{circumflex over (b)}* is given by$\max {\{ {{0.0625 + ( \frac{{PGAIN}( {4 - 0.0625} )}{63} )},0.0625} \}.}$

[0267] The optimal rotation R* is quantized as the transmission codePROT, which is set to 2(R*−E_(rot)+8) if L<80, and R*−E_(rot)+16 whereL≧80.

[0268] C. Encoding Codebook

[0269] Referring back to FIG. 10, in step 1006, encoding codebook 908generates a set of codebook parameters based on the received targetsignal x(n). Encoding codebook 908 seeks to find one or more codevectorswhich, when scaled, added, and filtered sum to a signal whichapproximates x(n). In a preferred embodiment, encoding codebook 908 isimplemented as a multi-stage codebook, preferably three stages, whereeach stage produces a scaled codevector. The set of codebook parameterstherefore includes the indexes and gains corresponding to threecodevectors. FIG. 14 is a flowchart depicting step 1006 in greaterdetail.

[0270] In step 1402, before the codebook search is performed, the targetsignal x(n) is updated as

x(n)=x(n)−by((n−R*)o/oL), 0≦n<L

[0271] If in the above subtraction the rotation R* is non-integral(i.e., has a fraction of 0.5), then

y(i−0.5)=−0.0073(y(i−4)+y(i+3))+0.0322(y(i−3)+y(i+2))−0.1363(y(i−2)+y(i+1))+0.6076(y(i−1)+y(i))

[0272] where i=n−└R*┘.

[0273] In step 1404, the codebook values are partitioned into multipleregions. According to a preferred embodiment, the codebook is determinedas ${c(n)} = \{ \begin{matrix}{1,} & {n = 0} \\{0,} & {0 < n < L} \\{{{CBP}( {n - L} )},} & {L \leq n < {128 + L}}\end{matrix} $

[0274] where CBP are the values of a stochastic or trained codebook.Those skilled in the art will recognize how these codebook values aregenerated. The codebook is partitioned into multiple regions, each oflength L. The first region is a single pulse, and the remaining regionsare made up of values from the stochastic or trained codebook. Thenumber of regions N will be ┌128/L┐.

[0275] In step 1406, the multiple regions of the codebook are eachcircularly filtered to produce the filtered codebooks, y_(reg)(n), theconcatenation of which is the signal y(n). For each region, the circularfiltering is performed as described above with respect to step 1302.

[0276] In step 1408, the filtered codebook energy, Eyy(reg), is computedfor each region and stored:${{{Eyy}({reg})} = {\sum\limits_{i = 0}^{L - 1}\quad {y_{reg}(i)}}},\quad {0 \leq {reg} < N}$

[0277] In step 1410, the codebook parameters (i.e., codevector index andgain) for each stage of the multi-stage codebook are computed. Accordingto a preferred embodiment, let Region(I)=reg, defined as the region inwhich sample I resides, or ${{Region}(I)} = \{ \begin{matrix}{0,} & {0 \leq I < L} \\{1,} & {L \leq I < {2L}} \\{2,} & {{2L} \leq I < {3L}} \\\cdots & \cdots\end{matrix} $

[0278] and let Exy(I) be defined as${{Exy}(I)} = {\sum\limits_{i = 0}^{L - 1}\quad {{x(i)}{y_{{Region}{(I)}}( {( {i + I} )\% L} )}}}$

[0279] The codebook parameters, I* and G*, for the j^(th) codebook stageare computed using the following pseudo-code. Exy^(*) = 0, Eyy^(*) = 0$\begin{matrix}{{for}\quad ( {I = \quad {0\quad {to}\quad 127}} )\{} \\{\quad {{computeExy}(I)}} \\{\quad {{if}\quad ( {{{{Exy}(I)}\sqrt{{Eyy}^{*}}} > {{{Exy}^{*}(I)}\sqrt{ {{Eyy}( {{Region}(I)} )} )}\{}} }} \\{\quad {{Exy}^{*} = {{Exy}(I)}}} \\{\quad {{Eyy}^{*} = {{Eyy}( {{Region}(I)} )}}} \\ {\quad  {I^{*} = I} \}} \}\end{matrix}$ and $G^{*} = {\frac{{Exy}^{*}}{{Eyy}^{*}}.}$

[0280] According to a preferred embodiment, the codebook parameters arequantized for efficient transmission. The transmission code CBIj(j=stage number−0, 1 or 2) is preferably set to I* and the transmissioncodes CBGj and SIGNj are set by quantizing the gain G*.${SIGNj} = \{ {{\begin{matrix}{0,} & {G^{*} \geq 0} \\{1,} & {G^{*} < 0}\end{matrix}{CBGj}} = \lfloor {{\min \{ {{\max \{ {0,{\log_{2}( | G^{*} | )}} \}},11.25} \} \frac{4}{3}} + 0.5} \rfloor} $

[0281] and the quantized gain Ĝ* is${\hat{G}}^{*} = \{ \begin{matrix}2^{0.75{CBGj}} & {{SIGNj} = 0} \\{- 2^{{0.75{CBGj}},}} & {{SIGNj} \neq 0}\end{matrix} $

[0282] The target signal x(n) is then updated by subtracting thecontribution of the codebook vector of the current stage

x(n)=x(n)−Ĝ*y _(Region(I*))((n+I*)% L),0≦n<L

[0283] The above procedures starting from the pseudo-code are repeatedto computeI*, G*, and the corresponding transmission codes, for thesecond and third stages.

[0284] D. Filter Update Module

[0285] Referring back to FIG. 10, in step 1008, filter update module 910updates the filters used by PPP encoder mode 204. Two alternativeembodiments are presented for filter update module 910, as shown inFIGS. 15A and 16A. As shown in the first alternative embodiment in FIG.15A, filter update module 910 includes a decoding codebook 1502, arotator 1504, a warping filter 1506, an adder 1510, an alignment andinterpolation module 1508, an update pitch filter module 1512, and anLPC synthesis filter 1514. The second embodiment, as shown in FIG. 16A,includes a decoding codebook 1602, a rotator 1604, a warping filter1606, an adder 1608, an update pitch filter module 1610, a circular LPCsynthesis filter 1612, and an update LPC filter module 1614. FIGS. 17and 18 are flowcharts depicting step 1008 in greater detail, accordingto the two embodiments.

[0286] In step 1702 (and 1802, the first step of both embodiments), thecurrent reconstructed prototype residual, r_(curr)(n), L samples inlength, is reconstructed from the codebook parameters and rotationalparameters. In a preferred embodiment, rotator 1504 (and 1604) rotates awarped version of the previous prototype residual according to thefollowing:

r _(curr)((n+R*)% L)=b rw _(prev)(n),0≦<L

[0287] where r_(curr) is the current prototype to be created, rw_(prev)is the warped (as described above in Section VIII.A., with$ {{TWF} = \frac{L_{p}}{L}} )$

[0288] version of the previous period obtained from the most recent Lsamples of the pitch filter memories, b the pitch gain and R therotation obtained from packet transmission codes as $\begin{matrix}{b =} & {\max \{ {{0.0625( \frac{{PGAIN}( {4 - 0.0625} )}{63} )},\quad 0.0625} \}} \\{R =} & \{ \begin{matrix}{{\frac{PROT}{2} + E_{rot} - 8},} & {L < 80} \\{{{PROT} + E_{rot} - 16},} & {L \geq 80}\end{matrix} \end{matrix}$

[0289] where E_(rot) is the expected rotation computed as describedabove in Section VIII.B.

[0290] Decoding codebook 1502 (and 1602) adds the contributions for eachof the three codebook stages to r_(curr)(n) as${r_{curr}( {( {{n--}i} )\% L} )} = {{r_{curr}( {( {n - I} )\% L} )} + \{ \begin{matrix}{\quad {G,}} & {\quad {{I < L},{n = 0}}} \\{\quad {{G\quad {{CBP}( {I - L + n} )}},}} & {\quad {{I \geq L},{0 \leq n < L}}}\end{matrix} }$

[0291] where I=CBIj and G is obtained from CBGj and SIGNj as describedin the previous section, j being the stage number.

[0292] At this point, the two alternative embodiments for filter updatemodule 910 differ. Referring first to the embodiment of FIG. 15A, instep 1704, alignment and interpolation module 1508 fills in theremainder of the residual samples from the beginning of the currentframe to the beginning of the current prototype residual (as shown inFIG. 12). Here, the alignment and interpolation are performed on theresidual signal. However, these same operations can also be performed onspeech signals, as described below. FIG. 19 is a flowchart describingstep 1704 in further detail.

[0293] In step 1902, it is determined whether the previous lag L_(p) isa double or a half relative to the current lag L. In a preferredembodiment, other multiples are considered too improbable, and aretherefore not considered. If L_(p)>1.85L, L_(p) is halved and only thefirst half of the previous period r_(prev)(n) is used. If L_(p)<0.54L,the current lag L is likely a double and consequently L_(p) is alsodoubled and the previous period r_(prev)(n) is extended by repetition.

[0294] In step 1904, r_(prev)(n) is warped to form rw_(prev)(n) asdescribed above with respect to step 1306, with${{TWF} = \frac{L_{p}}{L}},$

[0295] so that the lengths of both prototype residuals are now the same.Note that this operation was performed in step 1702, as described above,by warping filter 1506. Those skilled in the art will recognize thatstep 1904 would be unnecessary if the output of warping filter 1506 weremade available to alignment and interpolation module 1508.

[0296] In step 1906, the allowable range of alignment rotations iscomputed. The expected alignment rotation, E_(A), is computed to be thesame as E_(rot) as described above in Section VIII.B. The alignmentrotation search range is defined to be {E_(A)−δA, E_(A)−δA+0.5,E_(A)−δA+1, . . . , E_(A)+δA−1.5, E_(A)+δA−1}, where δA=max{6,0.15L}.

[0297] In step 1908, the cross-correlations between the previous andcurrent prototype periods for integer alignment rotations, R, arecomputed as${C(A)} = {\sum\limits_{i = 0}^{L - 1}{{r_{curr}( {( {i + A} )\% L} )}{{rw}_{prev}(i)}}}$

[0298] and the cross-correlations for non-integral rotations A areapproximated by interpolating the values of the correlations at integralrotation:

C(A)=0.54(C(A′)+C(A′+1))−0.04(C(A′−1)+C(A′+2))

[0299] where A′=A−0.5.

[0300] In step 1910, the value of A (over the range of allowablerotations) which results in the maximum value of C(A) is chosen as theoptimal alignment, A*.

[0301] In step 1912, the average lag or pitch period for theintermediate samples, L_(av), is computed in the following manner. Aperiod number estimate, N_(per), is computed as$N_{per} = {{round}( {\frac{A^{*}}{L} + \frac{( {160 - L} )( {L_{p} + L} )}{2L_{p}L}} )}$

[0302] with the average lag for the intermediate samples given by$L_{av} = \frac{( {160 - L} )L}{{N_{per}L} - A^{*}}$

[0303] In step 1914, the remaining residual samples in the current frameare calculated according to the following interpolation between theprevious and current prototype residuals:${\hat{r}(n)} = \{ \begin{matrix}{( {1 - \frac{n}{160 - L}} ){{rw}_{prev}( {( {n\quad \alpha} )\% \quad L} )}} & \quad \\{{{+ \frac{n}{160 - L}}{r_{curr}( {( {{n\quad \alpha} + A^{*}} )\% L} )}},} & {0 \leq n < {160 - L}} \\{{r_{curr}( {n + L - 160} )},} & {{160 - L} \leq n < 160}\end{matrix} $

[0304] where $\alpha = {\frac{L}{L_{av}}.}$

[0305] The sample values at non-integral points ñ (equal to either nα ornα+A*) are computed using a set of sinc function tables. The sincsequence chosen is sinc(−3−F: 4−F) where F is the fractional part of ñrounded to the nearest multiple of $\frac{1}{8}.$

[0306] The beginning of this sequence is aligned withr_(prev)((N−3)%L_(p)) where N is the integral part of ñ after beingrounded to the nearest eighth.

[0307] Note that this operation is essentially the same as warping, asdescribed above with respect to step 1306. Therefore, in an alternativeembodiment, the interpolation of step 1914 is computed using a warpingfilter. Those skilled in the art will recognize that economies might berealized by reusing a single warping filter for the various purposesdescribed herein.

[0308] Returning to FIG. 17, in step 1706, update pitch filter module1512 copies values from the reconstructed residual {circumflex over(r)}(n) to the pitch filter memories. Likewise, the memories of thepitch prefilter are also updated.

[0309] In step 1708, LPC synthesis filter 1514 filters the reconstructedresidual {circumflex over (r)}(n), which has the effect of updating thememories of the LPC synthesis filter.

[0310] The second embodiment of filter update module 910, as shown inFIG. 16A, is now described. As described above with respect to step1702, in step 1802, the prototype residual is reconstructed from thecodebook and rotational parameters, resulting in r_(curr)(n).

[0311] In step 1804, update pitch filter module 1610 updates the pitchfilter memories by copying replicas of the L samples from r_(curr)(n),according to

pitch_mem(i)=r _(curr)((L−(131%L)+i)%L), 0≦i<131

[0312] or alternatively,

pitch_mem(131−1−i)=r_(curr)(L−1−i%L),0≦i<131

[0313] where 131 is preferably the pitch filter order for a maximum lagof 127.5. In a preferred embodiment, the memories of the pitch prefilterare identically replaced by replicas of the current period r_(curr)(n):

pitch_prefilt_mem(i)=pitch_mem(i),0≦i<131

[0314] In step 1806, r_(curr)(n) is circularly filtered as described inSection VIII.B., resulting in s_(c)(n), preferably using perceptuallyweighted LPC coefficients.

[0315] In step 1808, values from s_(c)(n), preferably the last tenvalues (for a 10^(th) order LPC filter), are used to update the memoriesof the LPC synthesis filter.

[0316] E. PPP Decoder

[0317] Returning to FIGS. 9 and 10, in step 1010, PPP decoder mode 206reconstructs the prototype residual r_(curr)(n) based on the receivedcodebook and rotational parameters. Decoding codebook 912, rotator 914,and warping filter 918 operate in the manner described in the previoussection. Period interpolator 920 receives the reconstructed prototyperesidual r_(curr)(n) and the previous reconstructed prorotype residualr_(prev)(n), interpolates the samples between the two prototypes, andoutputs synthesized speech signal ŝ(n). Period interpolator 920 isdescribed in the following section.

[0318] F. Period Interpolator

[0319] In step 1012, period interpolator 920 receives r_(curr)(n) andoutputs synthesized speech signal ŝ(n). Two alternative embodiments forperiod interpolator 920 are presented herein, as shown in FIGS. 15B and16B. In the first alternative embodiment, FIG. 15B, period interpolator920 includes an alignment and interpolation module 1516, an LPCsynthesis filter 1518, and an update pitch filter module 1520. Thesecond alternative embodiment, as shown in FIG. 16B, includes a circularLPC synthesis filter 1616, an alignment and interpolation module 1618,an update pitch filter module 1622, and an update LPC filter module1620. FIGS. 20 and 21 are flowcharts depicting step 1012 in greaterdetail, according to the two embodiments.

[0320] Referring to FIG. 15B, in step 2002, alignment and interpolationmodule 1516 reconstructs the residual signal for the samples between thecurrent residual prototype r_(curr)(n) and the previous residualprototype r_(prev)(n), forming {circumflex over (r)}(n). Alignment andinterpolation module 1516 operates in the manner described above withrespect to step 1704 (as shown in FIG. 19).

[0321] In step 2004, update pitch filter module 1520 updates the pitchfilter memories based on the reconstructed residual signal {circumflexover (r)}(n), as described above with respect to step 1706.

[0322] In step 2006, LPC synthesis filter 1518 synthesizes the outputspeech signal ŝ(n) based on the reconstructed residual signal{circumflex over (r)}(n). The LPC filter memories are automaticallyupdated when this operation is performed.

[0323] Referring now to FIGS. 16B and 21, in step 2102, update pitchfilter module 1622 updates the pitch filter memories based on thereconstructed current residual prototype, r_(curr)(n), as describedabove with respect to step 1804.

[0324] In step 2104, circular LPC synthesis filter 1616 receivesr_(curr)(n) and synthesizes a current speech prototype, s_(c)(n) (whichis L samples in length), as described above in Section VIII.B.

[0325] In step 2106, update LPC filter module 1620 updates the LPCfilter memories as described above with respect to step 1808.

[0326] In step 2108, alignment and interpolation module 1618reconstructs the speech samples between the previous prototype periodand the current prototype period. The previous prototype residual,r_(prev)(n), is circularly filtered (in an LPC synthesis configuration)so that the interpolation may proceed in the speech domain. Alignmentand interpolation module 1618 operates in the manner described abovewith respect to step 1704 (see FIG. 19), except that the operations areperformed on speech prototypes rather than residual prototypes. Theresult of the alignment and interpolation is the synthesized speechsignal ŝ(n).

[0327] IX. Noise Excited Linear Prediction (NELP) Coding Mode

[0328] Noise Excited Linear Prediction (NELP) coding models the speechsignal as a pseudo-random noise sequence and thereby achieves lower bitrates than may be obtained using either CELP or PPP coding. NELP codingoperates most effectively, in terms of signal reproduction, where thespeech signal has little or no pitch structure, such as unvoiced speechor background noise.

[0329]FIG. 22 depicts a NELP encoder mode 204 and a NELP decoder mode206 in further detail. NELP encoder mode 204 includes an energyestimator 2202 and an encoding codebook 2204. NELP decoder mode 206includes a decoding codebook 2206, a random number generator 2210, amultiplier 2212, and an LPC synthesis filter 2208.

[0330]FIG. 23 is a flowchart 2300 depicting the steps of NELP coding,including encoding and decoding. These steps are discussed along withthe various components of NELP encoder mode 204 and NELP decoder mode206.

[0331] In step 2302, energy estimator 2202 calculates the energy of theresidual signal for each of the four subframes as $\begin{matrix}{{{Esf}_{i} = {0.5{\log_{2}( \frac{\sum\limits_{n = {40i}}^{{40i} + 39}{s^{2}(n)}}{40} )}}},} & {0 \leq i < 4}\end{matrix}$

[0332] In step 2304, encoding codebook 2204 calculates a set of codebookparameters, forming encoded speech signal s_(enc)(n). In a preferredembodiment, the set of codebook .parameters includes a single parameter,index I0. Index I0 is set equal to the value of j which minimizes$\begin{matrix}{\sum\limits_{i = 0}^{3}( {{Esf}_{i} - {{SFEQ}( {j,i} )}} )^{2}} & {{{where}\quad 0} \leq j < 128}\end{matrix}$

[0333] The codebook vectors, SFEQ, are used to quantize the subframeenergies Esf_(l) and include a number of elements equal to the number ofsubframes within a frame (i. e., 4 in a preferred embodiment). Thesecodebook vectors are preferably created according to standard techniquesknown to those skilled in the art for creating stochastic or trainedcodebooks.

[0334] In step 2306, decoding codebook 2206 decodes the receivedcodebook parameters. In a preferred embodiment, the set of subframegains G_(l) is decoded according to:

G _(i)=2^(SFEQC(I0,i)),

[0335] or

G _(i)2^(0.2SFEQ(I0,i),+0.8log) ₂ ^(Gprev-2)(where the previous framewas coded using a zero-rate coding scheme)

[0336] where 0≦i<4 and Gprev is the codebook excitation gaincorresponding to the last subframe of the previous frame.

[0337] In step 2308, random number generator 2210 generates a unitvariance random vector nz(n). This random vector is scaled by theappropriate gain Gi within each subframe in step 2310, creating theexcitation signal G_(i)nz(n).

[0338] In step 2312, LPC synthesis filter 2208 filters the excitationsignal G_(r)nz(n) to form the output speech signal, ŝ(n) .

[0339] In a preferred embodiment, a zero rate mode is also employedwhere the gain G_(i) and LPC parameters obtained from the most recentnon-zero-rate NELP subframe are used for each subframe in the currentframe. Those skilled in the art will recognize that this zero rate modecan effectively be used where multiple NELP frames occur in succession.

[0340] X. Conclusion

[0341] While various embodiments of the present invention have beendescribed above, it should be understood that they have been presentedby way of example only, and not limitation. Thus, the breadth and scopeof the present invention should not be limited by any of theabove-described exemplary embodiments, but should be defined only inaccordance with the following claims and their equivalents.

[0342] The previous description of the preferred embodiments is providedto enable any person skilled in the art to make or use the presentinvention. While the invention has been particularly shown and describedwith reference to preferred embodiments thereof, it will be understoodby those skilled in the art that various changes in form and details maybe made therein without departing from the spirit and scope of theinvention.

What is claimed is:
 1. A method for the variable rate coding of a speechsignal, comprising the steps of: (a) classifying the speech signal aseither active or inactive; (b) classifying said active speech into oneof a plurality of types of active speech; (c) selecting a coding modebased on whether the speech signal is active or inactive, and if active,based further on said type of active speech; and (d) encoding the speechsignal according to said coding mode, forming an encoded speech signal.2. The method of claim 1, further comprising the step of decoding saidencoded speech signal according to said coding mode, forming asynthesized speech signal.
 3. The method of claim 1, wherein said codingmode comprises a CELP coding mode, a PPP coding mode, or a NELP codingmode.
 4. The method of claim 3, wherein said step of encoding encodesaccording to said coding mode at a predetermined bit rate associatedwith said coding mode.
 5. The method of claim 4, wherein said CELPcoding mode is associated with a bit rate of 8500 bits per second, saidPPP coding mode is associated with a bit rate of 3900 bits per second,and said NELP coding mode is associated with a bit rate of 1550 bits persecond.
 6. The method of claim 3, wherein said coding mode furthercomprises a zero rate mode.
 7. The method of claim 1, wherein saidplurality of types of active speech include voiced, unvoiced, andtransient active speech.
 8. The method of claim 7, wherein said step ofselecting a coding mode comprises the steps of: (a) selecting a CELPmode if said speech is classified as active transient speech; (b)selecting a PPP mode if said speech is classified as active voicedspeech; and (c) selecting a NELP mode if said speech is classified asinactive speech or active unvoiced speech.
 9. The method of claim 8,wherein said encoded speech signal comprises codebook parameters andpitch filter parameters if said CELP mode is selected, codebookparameters and rotational parameters if said PPP mode is selected, orcodebook parameters if said NELP mode is selected.
 10. The method ofclaim 1, wherein said step of classifying speech as active or inactivecomprises a two energy band based thresholding scheme.
 11. The method ofclaim 1, wherein said step of classifying speech as active or inactivecomprises the step of classifying the next M frames as active if theprevious N_(ho) frames were classified as active.
 12. The method ofclaim 1, further comprising the step of calculating initial parametersusing a “look ahead.”
 13. The method of claim 12, wherein said initialparameters comprise LPC coefficients.
 14. The method of claim 1, whereinsaid coding mode comprises a NELP coding mode, wherein the speech signalis represented by a residual signal generated by filtering the speechsignal with a Linear Predictive Coding (LPC) analysis filter, andwherein said step of encoding comprises the steps of: (i) estimating theenergy of the residual signal, and (ii) selecting a codevector from afirst codebook, wherein said codevector approximates said estimatedenergy; and wherein said step of decoding comprises the steps of: (i)generating a random vector, (ii) retrieving said codevector from asecond codebook, (iii) scaling said random vector based on saidcodevector, such that the energy of said scaled random vectorapproximates said estimated energy, and (iv) filtering said scaledrandom vector with a LPC synthesis filter, wherein said filtered scaledrandom vector forms said synthesized speech signal.
 15. The method ofclaim 14, wherein the speech signal is divided into frames, wherein eachof said frames comprises two or more subframes, wherein said step ofestimating the energy comprises the step of estimating the energy of theresidual signal for each of said subframes, and wherein said codevectorcomprises a value approximating said estimated energy for each of saidsubframes.
 16. The method of claim 14, wherein said first codebook andsaid second codebook are stochastic codebooks.
 17. The method of claim14, wherein said first codebook and said second codebook are trainedcodebooks.
 18. The method of claim 14, wherein said random vectorcomprises a unit variance random vector.
 19. A variable rate codingsystem for coding a speech signal, comprising: classification means forclassifying the speech signal as active or inactive, and if active, forclassifying the active speech as one of a plurality of types of activespeech; and a plurality of encoding means for encoding the speech signalas an encoded speech signal, wherein said encoding means are dynamicallyselected to encode the speech signal based on whether the speech signalis active or inactive, and if active, based further on said type ofactive speech.
 20. The system of claim 19, further comprising aplurality of decoding means for decoding said encoded speech signal. 21.The system of claim 19, wherein said plurality of encoding meansincludes a CELP encoding means, a PPP encoding means, and a NELPencoding means.
 22. The system of claim 20, wherein said plurality ofdecoding means includes a CELP decoding means, a PPP decoding means, anda NELP decoding means.
 23. The system of claim 21, wherein each of saidencoding means encodes at a predetermined bit rate.
 24. The system ofclaim 23, wherein said CELP encoding means encodes at a rate of 8500bits per second, said PPP encoding means encodes at a rate of 3900 bitsper second, and said NELP encoding means encodes at a rate of 1550 bitsper second.
 25. The system of claim 21, wherein said plurality ofencoding means further includes a zero rate encoding means, and whereinsaid plurality of decoding means further includes a zero rate decodingmeans.
 26. The system of claim 19, wherein said plurality of types ofactive speech include voiced, unvoiced, and transient active speech. 27.The system of claim 26, wherein said CELP encoder is selected if saidspeech is classified as active transient speech, wherein said PPPencoder is selected if said speech is classified as active voicedspeech, and wherein said NELP encoder is selected if said speech isclassified as inactive speech or active unvoiced speech.
 28. The systemof claim 27, wherein said encoded speech signal comprises codebookparameters and pitch filter parameters if said CELP encoder is selected,codebook parameters and rotational parameters if said PPP encoder isselected, or codebook parameters if said NELP encoder is selected. 29.The system of claim 19, wherein said classification means classifiesspeech as active or inactive based on a two energy band thresholdingscheme.
 30. The system of claim 19, wherein said classification meansclassifies the next M frames as active if the previous N_(ho) frameswere classified as active.
 31. The system of claim 19, wherein thespeech signal is represented by a residual signal generated by filteringthe speech signal with a Linear Predictive Coding (LPC) analysis filter,and wherein said plurality of encoding means includes a NELP encodingmeans comprising: energy estimator means for calculating an estimate ofthe energy of the residual signal, and encoding codebook means forselecting a codevector from a first codebook, wherein said codevectorapproximates said estimated energy; and wherein said plurality ofdecoding means includes a NELP decoding means comprising: random numbergenerator means for generating a random vector, decoding codebook meansfor retrieving said codevector from a second codebook, multiply meansfor scaling said random vector based on said codevector, such that theenergy of said scaled random vector approximates said estimate, andmeans for filtering said scaled random vector with an LPC synthesisfilter, wherein said filtered scaled random vector forms saidsynthesized speech signal.
 32. The system of claim 19, wherein thespeech signal is divided into frames, wherein each of said framescomprises two or more subframes, wherein said energy estimator meanscalculates an estimate of the energy of the residual signal for each ofsaid subframes, and wherein said codevector comprises a valueapproximating said subframe estimate for each of said subframes.
 33. Thesystem of claim 19, wherein said first codebook and said second codebookare stochastic codebooks.
 34. The system of claim 19, wherein said firstcodebook and said second codebook are trained codebooks.
 35. The systemof claim 19, wherein said random vector comprises a unit variance randomvector.